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Abstract 

Feature selection is an important pre-processing step for many pattern classification tasks. Tra- 
ditionally, feature selection methods are designed to obtain a feature subset that can lead to high 
classification accuracy. However, classification accuracy has recently been shown to be an inap- 
propriate performance metric of classification systems in many cases. Instead, the Area Under 
the receiver operating characteristic Curve (AUC) and its multi-class extension, MAUC, have 
been proved to be better alternatives. Hence, the target of classification system design is grad- 
ually shifting from seeking a system with the maximum classification accuracy to obtaining a 
system with the maximum AUC/MAUC. Previous investigations have shown that traditional fea- 
ture selection methods need to be modified to cope with this new objective. These methods most 
often are restricted to binary classification problems only. In this study, a filter feature selection 
method, namely MAUC Decomposition based Feature Selection (MDFS), is proposed for multi- 
class classification problems. To the best of our knowledge, MDFS is the first method specifically 
designed to select features for building classification systems with maximum MAUC. Extensive 
empirical results demonstrate the advantage of MDFS over several compared feature selection 
methods. 
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1. Introduction 

Feature selection is an irnportant data pre-processing technique in the machine learning and 
data mining community [HQ^Istl- By selecting a feature subset from the original feature set, 
the time and storage requirements of classification tasks are reduced. In addition, reducing the 
number of features may facilitate data visualization and understanding, or even improve the 
performance of classification systems |2]. Generally speaking, feature selection methods can be 
divided into two categories, i.e., filter and wrapper [4]. In a filter method, the whole selection 
procedure is conducted solely based on the data set. On the other hand, a wrapper method 
employs the classifier that will be used in the classification task afterward to evaluate the merit 
of each candidate feature subset. It is well known that, in general, filter methods are more 
computationally eflicient, while wrapper methods will lead to better classification performance 



'Email: {wmill08@mail.ustc.edu.cn, ketang@ustc.edu.cn) 
Preprint submitted to Elsevier 



January 21, 2013 



for the specific classifier f^. In recent years, with the emergence of many large-scale problems 
that may involve thousands of features (e.g., gene expression lilS] and text classification |01), 
the efficiency of feature selection methods has become of greater concern to both researchers 
and practitioners. Therefore, filter methods, although sometimes leading to inferior classification 
performance, are attracting increasing interest. 

In filter methods, since no classifier is used to evaluate candidate feature subsets, alternative 
metrics are needed to evaluate their utility. Due to the consideration of computational efficiency, 
many so-called feature ranking methods evaluate the utility of individual features, and pick out 
the top ones. Fisher's ratio lUt], Pearson's correlation coefficient i^, Ch i-sq uare ifioll . information 
gain [IT, 15], symmetrical uncertainty (l3] and distance discriminant [T?] have all been utilized 
as metrics for this purpose. In addition to measuring the utility of individual features, the Relief 
methods ifTsl [l6ll further take the interaction between features and local characteristics of the 
sample space into consideration, and thus are more likely to obtain a good feature subset rather 
than a set of good individual features. However, all these methods may suffer from selecting 
redundant features that provide no additional information but cause more computation time for 
classification. To address this disadvantage, more recent filter methods, such as the minimal Re- 
dundancy Maximal Relevance (mRMR) H7i1 and Fast Correlation Based Filter (FCBF) methods 
ifisll . are equipped with schemes to exclude redundant features. Since these methods usually in- 
volve calculating of the relevance between pairs of features, a major payoff of them is the much 
higher computational cost. 

A good feature selection method should not only be efficient, but also guarantee high classi- 
fication performance. Traditionally, this means that the feature selection process should select a 
feature subset that leads to high classification accuracy. Most state-of-the-art methods, including 
those mentioned above, have been demonstrated to be effective with regards to this objective. 
However, recent progresses in the machine learning area and new application domains have 
revealed that accuracy itself is not necessarily a good performance metric 1.19.1. First, using ac- 
curacy assumes that the prior probabilities of different classes in the data sets are approximately 
equal, which is not the case in many real-world applications (such as imbalanced learning prob- 
lems [20]). Second, using accuracy assumes that different types of misclassifications induce the 
same cost, which does not hold in many real-world applications (such as cost sensitive learning 
problems f^T]). To address these shortcomings of accurac y, tw o alternative metrics, called Area 
Under the receiver operating characteristic Curve (AUC) 11221 12311 and its multi-class extension, 
named MAUC ll24ll . have been introduced in recent years. Specifically, AUC is used to evaluate 
binary classifiers and MAUC for multi-class ones. They measure the performance of classifiers 
without making implicit assumptions about the prior probability of classes or the misclassifica- 
tion costs. Therefore, classification systems with maximized AUC/MAUC can be more useful for 
real-world problems that involve unequal, unknown or even changing class distribution and mis- 
classification costs lEsll . Moreover, it has been theoretically proved that AUC is more powerful 
than accuracy for discriminating classification systems, and extensive empirical studies showed 
that similar conclusion also holds for MAUC fl^. In other words, even for balanced data sets 
that do not involve different costs, AUC and MAUC are still superior to accuracy in the sense 
that they facilitate choosing the best classification system from a number of candidates. There- 
fore, the aim of designing a classification system is now gradually shifting from maximizing the 
accuracy of the system to maximizing its AUC or MAUC. Hereafter, we refer to classification 
systems that are designed according to this new objective as AUC/MAUC-oriented classification 
systems. 

Given the difference between accuracy and AUC/MAUC, it is interesting to ask whether 

2 



traditional feature selection methods can cope with the new challenge raised by AUC/MAUC- 
oriented classification systems. Recently, some initial studies have been carried out to address 
this issue. In |26ll. AUC is employed to rank features directly. This work was then further 
extended in [27]. Empirical studies have shown that these two methods, although derived from 
traditional filter methods with minor modifications, significantly outperform traditional methods. 
This observation is not unexpected since it has been stated that a successful feature selection 
method should consider the objective of the classification systems (iS]. However, both of the 
above methods focus only on binary classification problems. To the best of our knowledge, no 
work has been published in literature to address multi-class classification problems. Yet, multi- 
class problems are very common in practice and there is a need for suitable MAUC-oriented 
feature selection methods. We therefore propose in the paper a novel feature selection method 
for MAUC-oriented classification systems. 

The proposed method, namely MAUC Decomposition based Feature Selection (MDFS), is 
in essence a filter method. In MDFS, a multi-class problem is first divided into a batch of binary 
class sub-problems in one-versus-one manner (i.e., each sub-problem consists of a pair of classes 
l'^^). After that, AUC is used to rank all features within each sub-problem. Thus, a feature rank- 
ing list can be obtained for each sub-problem. Finally, the sub-problems are accessed iteratively. 
Every time that a sub-problem is considered, one feature is picked out from the corresponding 
feature ranking list. In this way, the "siren pitfall" phenomenon [^3^ that is usually encountered 
in multi-class feature selection is avoided. Extensive empirical studies have been conducted to 
compare MDFS to 8 other feature selection methods on 8 multi-class data sets with 4 different 
types of classifiers. The results clearly demonstrated the superiority of MDFS. 

The rest of the paper is organized as follows: Related feature selection methods are briefly 
reviewed in Section 2. In Section 3, AUC and MAUC are introduced. After that, MDFS is 
described in detail in Section 4. Section 5 presents the experimental setup and results. Finally, 
conclusions and discussions are given in Section 6. 

2. Feature Selection Methods Revisited 

In this section, we will review the filter methods that are closely related to our work, including 
three feature ranking methods, the SpreadFx approach propo sed in 1 30] , the ReliefF method l(l6ll , 
which is the multi-class extension of the Relief method, and the minimal Redundancy Maximal 
Relevance (mRMR) method iHtIi . 

The main notations used in this paper are summarized as follows. D = {xi, X2, . . . , x„} is the 
training data set, where jc,- (1 < / < n) is an instance. F = {/i,/2, ■ • ■ ,/m) is the original feature 
set, where // (1 < / < to) is a feature. In addition, the class variable is denoted by y, whose value 
can be one of {1, 2, . . . , c} for each instance. 

2.1. Feature Ranking methods 

Feature ranking methods score each feature individually according a pre-defined criterion. 
Then the top K (a user-defined number) features with the largest scores will be selected. Methods 
of this category are very popular due to their high computational efficiency and simplicity. In the 
following, we will briefly revisit three popular feature ranking methods. 
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2.1.1. Feature Ranking Based on Chi-Square 

The feature ranking method based on the Chi-square metric lUoll utihzes a very simple selec- 
tion strategy. For a nominal feature fi, its Chi-square statistics CHI for the class variable y can 
be calculated as follows, 

where Ojk is the number of instances with feature value fi - fij (fij is a possible value of feature 
fi) and y = k. 

Oj. X O k 

Ejk = — (2) 

n 

where Oj. is the number of instances with feature value f - fij, and Ok is the number of instances 
with y - k. 

2.1.2. Feature Ranking Based on Symmetrical Uncertainty 

Instead of the Chi-square statistics, this method use the symmetrical uncertainty fo*] statistics 
SU to rank features. As a variant of mutual information, symmetrical uncertainty avoids the bias 
of mutual information to features with many distinct values and lies in the range [0,1]. For a 
nominal feature f and the class variable y, SU between them is calculated as, 

^^x^ (3) 
Hif) + H(y) 

where 

H(f) = - 2 P(f' = fu) log Pifi = fij) (4) 

J 

C 

H{y)^-Y, P(y = k) log p(y = k) (5) 

are the entropy of /, and y respectively, and 

I{fi-y)^H{y)-H{y\fi) (6) 
denotes the mutual information between feature f and class variable y. Furthermore, 

H{y\fi) = - 2 P^fi = fih y^k) log p{y = k\f = fij) (7) 

jk 

is the conditional entropy of y given f. 

2.1.3. Feature Ranking Based on Distance Discriminant 

To select features that can separate different classes while keep the instances in the same class 
close to one another. Feature Selection based on Distance Discriminant (FSDD), was proposed 
in Iil41 . FSDD calculates the utility of feature f according to Eq. ([8]l, 
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where y8 is a tuning parameter (by default, /? - 2), n/c is the number of instances in the k-th class. 



(9) 



are the variance and mean of feature f; over all instances in data set D. Here, x'. is the value of 
feature for instance xj. 



.Xjeclass(k) Xjeclass(k) 



(10) 



are the variance and mean of feature over all instances that belong to the ^-th class. 



„"2 - ^ 

k=l 

is the weighted variance of the feature fi over c differen classes 



2.2. SpreadFx 

To overcome the "siren pitfall" phenomenon that adheres to traditional feature ranking meth- 
ods in multi-class problems (we will detail this issue in Section 4.1), the SpreadFx [30] feature 
selection approach was proposed. Generally speaking, SpeadFx methods first decompose the 
multi-class problem into c binary sub-problems in one-versus-all manner (i.e., each sub-problem 
consists of one class as positive class and all the other classes jointly form a negative class). 
Then, a feature ranking list will be obtained on each sub-problem. Finally, features are selected 
by applying some dynamic scheduling policy to the ranking lists. Two key components need 
to be specified when employing a SpreadFx type method: the feature ranking method and the 
the dynamic scheduling policy. In practice, the former can be any feature ranking method (for 
example, any of the three methods described above). As for the latter, it has been shown in fSOfl 
that selecting sub-problem one by one in turn (the Round-Robin policy) performed satisfactorily. 



2.3. ReliefF 

The Relief methods first calculate a weight for each feature. Then they select the features 
with the largest weights lIlSll . Different from feature ranking methods, the weights of features 
are calculated in an iterative way. The calculation is based on the assumption that instances be- 
longing to the same class and close to one another should have similar values on a useful feature, 
while instances that are close to one another but are from different classes should have quite 
different values. At each iteration. Relief methods choose one instance and its nearest neighbors 
in each class. Then, the difference between this instance and its neighbors on every feature are 
employed to update the weights of features. This procedure is repeated for a pre-defined num- 
ber of iterations. As an extension of the original Relief method, ReliefF was proposed to tackle 
multi-class problems and is more robust to noisy and missing values in data sets IJ6] . 
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2.4. Minimal Redundancy Maximal Relevance Method 

Some recent filter feature selection methods are equipped with schemes to explicitly exclude 
redundant features. A representative method of them is the minimal Redundancy Maximal Rel- 
evance (mRMR) method 1 17]. Specifically, mRMR first evaluates the relevance of each feature 
based on its mutual information (Eq. (|6]l) with the class variable, then the feature with largest 
relevance score is selected in the first iteration. After that, features are selected one at a time 
according to the following criterion: 



fseiecied = ai'g max 

ffiF-S 



i{fi,y)-^.Y^ii.frJj) 



(12) 



where fseieaed dcnotcs the feature being selected at each iteration, S is the currently selected 
feature subset, and \S \ is its cardinality. The algorithm terminates when a pre-defined number of 
features have been selected. Since mRMR detects the redundancy among features by calculating 
the mutual information between pairs of features, it has a higher computational complexity in 
comparison to other filter methods. 



3. Area Under the ROC Curve and its Multi-class Extension 

3.1. Area Under the ROC Curve 

Assume that a data set consists of n instances, with P of them belonging to class 1 (positive 
class) and A'^ of them belonging to class 2 (negative class). AUG can be used to evaluate the 
performance of a classification system on this binary class data set. Generally speaking, most 
mainstream classifiers allow assigning a numerical score (e.g., probability of the instance be- 
longing to the positive class) to each of the n instances after training. Then, the AUG value of 
this classifier can be calculated based on these n scores and the corresponding true class labels. 
To be specific, the AUG value of a classifier equals to the probability that a randomly chosen 



positive instance will be assigned a larger score than a randomly chosen negative instance Bill . 
As an example, given a classifier whose AUG value is 0.85 on data set D, for a randomly chosen 
positive instance xi and a randomly chosen negative instance xi from D, the expected probability 
that xi will get higher score than X2 is 0.85. 

Let a classifier h(xi) — > K. outputs a numerical score to indicate the confidence that x, belongs 
to the positive class. AUG can be calculated as, 

_ E.i,edflji(l);j:jedaij(2) -^j) 

PxN 



where i(x,, xj) is defined as: 

s(Xi,Xj) 



1 if h(xi) > h{xj); 

0.5 \fh{xi)^h{xj); (14) 
if h{xi) < h(xj). 
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3.2. MAUC 

Although AUC has been studied extensively in the literature, it is only applicable to binary 
classification problems. A simple extension of AUC which can be used in multi-class problems 
was proposed in li24ll . For a multi-class classification problem containing c classes, a classifier 
assigns c scores to every instance. Each score corresponds to one of the c classes and indicates 
the confidence that the instance belongs to this class. Hence, the scores for all the n instances can 
be represented by an n-by-c matrix, the columns and rows of which correspond to classes and 
instances respectively. Let (A^,) be the AUC calculated according to the i-th (j-th) column of 
the matrix with respect to the instances from class / and class j, MAUC is defined as follows, 

MAUC^ yy^iili£. (15) 

1=1 J=l+l 

From Eq. (fTsT i, we can see that the MAUC value of a classification system is actually the 

c(c—l) 

average AUC value over all ^ one-versus-one binary sub-problems. In other words, this 
means that maximizing the AUC value of each binary sub-problem is of equal importance in 
calculating a classification system's MAUC. 



4. Feature Selection for MAUC-Oriented Classification Systems 

Given the difference between accuracy and MAUC, the question that is addressed in this 
study is: How can the feature selection process facilitates the construction of MAUC-oriented 
classification systems? In general, three methodologies can be employed for this purpose. First, 
traditional feature selection methods can be applied directly. Second, in analogy to [261 . MAUC 
can be used as a relevance metric to rank features. Third, one may also develop a novel feature 
selection method. We will start by considering the efficacy of the former two methodologies. 
Then, the MDFS method, which belongs to the third methodology, will be described in detail. 

4.1. Using Traditional Feature Selection Methods Directly 

As mentioned before, previous work showed that traditional feature selection methods may 
be easily outperformed by new methods specifically designed for AUC-oriented classification 
systems on binary problems. It is natural to expect that such a situation also holds for multi-class 
problems, although solid evidence is absent in the literature. In fact, traditional feature selection 
methods might be unsuitable for MAUC-oriented classification systems not only because of the 
difference between accuracy and MAUC, but also due to the so-called "siren pitfall" phenomenon 
that can occur in traditional feature ranking methods: Since the difficulties of separating different 
classes are usually different in multi-class problems, features that perform well on easy sub- 
problems (i.e., the sub-problems consists of classes that are easy to separate) usually can obtain a 
relatively higher utility scores, while features that perform well on difficult sub-problems usually 
obtain lower scores. As a result, when these features are compared to each other, those which 
perform well on easy sub-problems are more likely to be selected isoll . In consequence, easy 
sub-problems will be focused on more than difficult sub-problems. However, when calculating 
the MAUC value of a classification system, it is equally important to maximize the AUC value 
of every sub-problem. Hence, the "siren pitfall" phenomenon of feature selection methods may 
also degenerate the performance of MAUC-oriented classification systems. 
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In the literature, the "siren pitfall" phenomenon was only reported for text classification sys- 
tems whose performance were measured by Precision, F-measure, and Recall LSO]. To verify 
whether it also exists in MAUC-oriented classification systems, we carried out a case study on the 
Indiana data set fj^. This hyperspectral imagery data set is a section of scenes taken over north- 
west Indiana's Indian Pines by the airborne visible/infrared imaging spectrometer (AVIRIS), and 
consists of 9 classes and 220 features. Similar to the experimental setup in i30ll . the Naive Bayes 
classifier was applied to every one-versus-one sub-problem using all the features. Two feature 
selection methods, namely feature ranking based on the Chi-square metric (CHI) and feature 
ranking based on the symmetrical uncertainty metric (SU), were employed to select 100 features 
on the whole data set (global best features). Then, they were applied to each sub-problem sepa- 
rately to rank all the 220 features. The higher the rank (let rank 1 be the highest rank), the more 
useful is this feature for the corresponding sub-problem. If the global best features work well on a 
sub-problem, they will get high rank on it. Fig. 1 presents the AUC obtained by the Naive Bayes 
classifier on the 36 sub-problems. Figs. 2 and 3 illustrate the ranks of the global best 100 features 
on each sub-problem. It can be observed that the global best features selected by both CHI and 
SU got rather low ranks on those difficult sub-problems (i.e., the sub-problems corresponding 
to smaller AUC in Fig. 1). For example, for the 34th sub-problem, most of the 100 global best 
features were not within the top features. This observation suggests that the "siren pitfall" should 
also be taken care of for MAUC-oriented systems. In fid^, SpreadFx was proposed to cope with 
the "siren pitfall" phenomenon. However, it was not designed for MAUC-oriented classification 
systems and may not suit the aim of MAUC maximization well. Hence, new feature selection 
methods need to be developed. 
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Figure 1: Difficulty of the binary classification sub-problems of the Indiana data set. The larger the AUC, the easier the 
coiTesponding sub-problem is. 
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Figure 2: Ranks of the top 100 global best features on the 36 sub-problems. The 100 features are selected by CHI 
method. 
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Figure 3: Ranks of the top 100 global best features on the 36 sub-problems. The 100 features are selected by SU method. 
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4. 2. Using MAUC as Relevance Metric Directly 

Since the value of a feature on each instance can be interpreted as the output of a single 
feature classifier, a straightforward feature selection method for MAUC-oriented classification 
systems is using MAUC as the relevance metric to rank features directly (referred to as MAUCD 
hereafter). The MAUC score of a feature can be calculated in two steps: First, decompose the 
multi-class problem to a batch of one-versus-one binary class sub-problems, and then calculate 
the AUC score of this feature on every sub-problem. Second, calculate this feature's MAUC 
score using Eq. (flST l. In this method, the utility of a feature is measured by averaging its utility 
over all sub-problems, and thus quite a few features that favor easy sub-problems will get large 
MAUC scores. Consequently, these features are more likely to be selected than those features 
that are more useful for difficult sub-problems. In other words, directly using MAUC to rank 
and select features can induce the "siren pitfall" phenomenon as well, and thus might not be an 
ideal solution for MAUC-oriented classification systems. Furthermore, it is likely that different 
features may be useful for different sub-problems, and we can anticipate that conducting feature 
selection on every sub-problem separately and collecting all the obtained feature subsets will 
form a feature subset that yields good performance. Therefore, instead of the direct use of MAUC 
as feature ranking metric, we design the MAUC Decomposition based Feature Selection (MDFS) 
method. 

4.3. MAUC Decomposition based Feature Selection Method 

Give a data set D, each instance x, may belong to one of c (c > 2) classes and is represented 
by m features. The MDFS method works as follows. First, D will be decomposed into ^^'^^ 
binary class sub-problems in one-versus-one manner (i.e., each sub-problem consists of a pair 
of classes). Then, the features are ranked according to their AUC scores on every sub-problem. 
This leads to ^^^-^ feature ranking lists. After that, feature selection is carried out iteratively. 
In each iteration, a sub-problem is randomly chosen and the previously unselected feature with 
the highest rank in the corresponding feature ranking list is moved to the selected feature subset. 
Since the AUC score is used to rank features in every sub-problem, MDFS can only deal with 
numerical, ordered and binary type features. Nominal features which take more than two possible 
values need to be converted to appropriate numerical features before using MDFS. Algorithm 1 
presents the main steps of MDFS. 

4.4. Computational Complexity of MDFS 

In this subsection, the time complexity of MDFS is analyzed and compared to some other 
existing feature selection methods. Let the number of instances in the /-th class of data set D be 
n, . The complexity of calculating the AUC score of one feature on the sub-problem consisting of 
the /-th and y-th class then is 0{{ni + nj) log(n,- + njfj fz^ . Since MDFS requires calculating the 
AUC scores of m features on every sub-problem, its time complexity is: 



c 




(16) 



!=1 



Since 



c c 



c c 



c(c-l) 
2 



m 



X " + nj) < 



m 



n — 



■mn log n 



(17) 



1=1 j=i+\ 



i=i j=i+\ 
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Input : Multi-class data set D, a user set feature subset size K 
Output: A set S contains selected features 

15^0; 

2 for / «— 1 to c do 

3 for y «— / + 1 to c do 

4 get the binary sub-problem Djj in one-versus-one manner; 

5 get the feature rank list according to their AUC scores on Dij; 

6 end 

7 end 

8 while \S\<K do 
select a sub-problem Dij randomly; 
identify the best feature / from L,j; 
pop / out of Lij; 
it fiS then 

I put / into S ; 
end 



9 
10 
11 
12 
13 
14 

15 end 

16 return S 



Algorithm 1: The MDFS Feature Selection Algorithm 



and the class number c is usually small in practice, the complexity of MDFS is roughly 0(mn log n). 
The main computational cost of MAUCD is also induced by calculating the AUC scores of fea- 
tures on all binary sub-problems. Hence, the complexity of MAUCD is the same as that of 
MDFS. According to [14], the time complexity of FSDD is 0(mn). The CHI and SU methods 
are designed to deal with nominal features. For numerical features, a discretization procedure 
is needed to convert the numerical features into nominal ones. A typical discretization method 
requires sorting the numerical values first, and then scanning over the sorted values to convert an 
interval of continuous values to a discrete value iBsll . Hence, the complexity of these two feature 
ranking methods in dealing with data sets consists of numerical type features is also 0{mn log «). 
Following this analysis, the complexity of SpreadFx feature selection approach using one of these 
feature ranking methods to rank features on every sub-problem is 0{cmn log n). Again, omitting 
the constant c, SpreadFx's complexity is 0{mnlogn) as well. The complexity of ReliefF is 
O(tmn), where t is the number of iterations to update features' weights 1.161 . The configuration 
of t involves many factors and is a non-trivial task If t is set to log n as recommended 

in 11351], the complexity of ReliefF will also be 0(mn\ogn). In addition, since mRMR detects 
the redundancy among features by calculating the mutual information between pairs of features, 
the time complexity of mRMR is Oin-p-n log n). To summarize, the computational complexity of 
MDFS is comparable to that of existing filter feature selection methods. 



5. Experiments 

Experimental studies have been carried out to evaluate the performance of MDFS. Our ex- 
periments were designed based on three considerations. First, the efficacy of MDFS needs to be 
verified against traditional filter feature selection methods. Second, since the focus of this work 
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Table 1 : Summary of 8 benchmark data sets 



Data Set 


No. of Instances 


No. of Features 


No. of Classes 


ISOLET 


7797 


617 


26 


MNIST 


10000 


784 


10 


USPS 


7291 


256 


10 


Phoneme 


4509 


256 


5 


Washington 


11200 


210 


7 


Indiana 


9345 


220 


9 


Synthetic 


600 


60 


6 


Thyroid 


168 


2000 


4 



is filter methods, the experiments should not be restricted to a specific type of classifier Finally, 
the experiments should be conducted on data sets with sufficient numbers of features. Other- 
wise, it is not necessary to conduct feature selection at all. Having these considerations in mind, 
9 filter methods (including MDFS) and 4 different types of classifiers have been selected for the 
comparison. Altogether 36 combinations of feature selection methods and classifiers have been 
applied to 8 multi-class data sets collected from various domains with more than 60 features. 

5.7. Data Sets 

Eight benchmark multi-class data sets from various domains were collected for our exper- 
iments. The ISOLET data set |36], MNIST data set |3?], USPS data set ^ are handwriting 
recognition problems. The Phoneme data set 03 8p is from the speech recognition field. The 
Washington data set [391] and Indiana data set i32[] are hyperspectral imagery data sets. More 
details about these two data sets can be found in [40]. Synthetic data set |36ll] is a synthetically 
generated control chart data set and Thyroid [41.1 is a microarray data set. Table 1 summarizes 
the information about these data sets. 



5.2. Experimental Setup 

The three feature ranking methods introduced in Section 2.1, and ReliefF were picked as 
the baseline feature selection methods in our experiments. To compare with SpreadFx, we set 
CHI and SU separately as the feature ranking method in SpreadFx, and employed Round-Robin 
as the scheduling scheme. The resulting algorithms are referred to as SpreadFx [Round-Robin, 
CHI] (SCHI) and SpreadFx [Round-Robin, SU] (SSU) respectively. Besides, the mRMR feature 
selection method introduced in Section 2.4 and MAUCD introduced in Section 4.2 were also 
included for comparison. 

Since none of of these methods can automatically decide how many features should be se- 
lected in a given problem, we compared them on different feature subset sizes from 10 to the 100 
with an interval of 10. Since the Synthetic data set only consists of 60 features, 6 feature subset 
sizes were considered on this data set. 

In order to examine whether MDFS is biased towards a certain type of classifier, 4 different 
types of classifiers were used in our experiments: 1-Nearest Neighbor (INN) ]j42], C4.5 1431. 
Naive Bayes (a^, and SVM with RBF kernel function 

All the compared feature selection methods and classifiers were implemented on the WEKA 
platform ll46ll . the number of iterations t used to update features' weight in ReliefF method was 
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set to log n, where n is the number of training instances. The parameters c and y of SVM were set 
to the values which maximize the average MAUC in a 3-fold cross-validation on the training data 
set of a 10-fold cross-validation, where c was sampled at 2"^, 2"^, . . . , 2'^, and y was sampled at 
2~i5 2 13 . . . ^ 2^. All other parameters and implementations followed the default configuration 
ofWEKA. 

All classification systems with difl'erent configurations (classification algorithm, feature se- 
lection method, feature subset size) were evaluated on the 8 data sets by applying 10-fold cross- 
validation for 10 times. The average MAUC values of the classification systems with the same 
feature subset size and classification algorithm were used as the indicator to compare different 
feature selection methods. The Wilcoxon signed-rank test with 95% confidence level was em- 
ployed to examine whether the diflFerences between the performance of MDFS and other feature 
selection methods are statistically significant. 

5.3. Results 

The results of our experiments are summarized in Table 2 to Table 9 and Fig. 4. Tables 2 
to 9 present the MAUC value for each configuration, one table for one data set. The results of 
the pairwise Wilcoxon signed-rank test between MDFS and 8 other feature selection methods 
are also labeled as superscript on the MAUC values of corresponding classification systems, 
t (t) means the corresponding result is statistically worse (better) than the result of MDFS. 
Otherwise, no difference has been detected by the Wilcoxon test. The largest MAUC value of 
each configuration is in boldface. 

For the ISOLET data set, MDFS outperformed all the other 8 compared methods on every 
classifier when the feature subset size is larger than 10. For the MNIST data set, except for two 
configurations (SVM + feature subset size 90, 100), MDFS always obtained the best results. A 
similar situation can be observed with the USPS data set. Hence, the superiority of MDFS has 
been proved on these three data sets. On the speech recognition data set (i.e., the Phoneme data 
set), none of the feature selection methods dominated the others. On the Washington data set, 
MDFS outperformed others when cooperated with Naive Bayes and INN. For other configura- 
tions, SSU or SCHI performed better. For Indiana data set, mRMR worked significantly well 
with Naive Bayes classifier, and SSU or SCHI performed better otherwise. For the synthetically 
control chart data set, classifiers working with MDFS resulted in the largest MAUC in most 
cases. For the Thyroid data set, MDFS performed well with INN and SVM. As for Naive Bayes 
and C4.5, MDFS is comparable with the other methods. 

In general, we can see that the classification systems employing MDFS as feature selection 
method have led to the largest MAUC value more often than not. In the cases that MDFS was not 
the best, it still outperformed some compared methods. Specifically, MAUCD was almost always 
inferior to MDFS, this is not surprising due to the reasons stated in Section 4.2. The three fea- 
ture ranking methods, CHI, SU and FSDD were clearly outperformed by MDFS throughout our 
experiments, which also coincided with our analysis in Section 4. 1 . Despite its appealing perfor- 
mance in accuracy-oriented classification systems, the mRMR method was dominate by MDFS 
in MAUC-oriented classification systems in most of the cases throughout our experiments. SCHI 
and SSU were supposed to be the biggest challenger to MDFS. However, the overall results on 
these 8 data sets clearly demonstrated the advantage of MDFS. 

Fig. 4 sunmiarizes the results of the Wilcoxon signed rank tests conducted between MDFS 
and the other 8 compared methods. There are 4 sub-graphs, corresponding to the comparisons 
on 4 difi'erent types of classifiers. The hight of each bar is the number of times that MDFS win 
(draw or lose) the counterpart feature selection method on the corresponding classifier over aU 
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Figure 4: Results of the Wilcoxon signed ranlc test with 95% confidence level between MDFS and the 8 other compared 
feature selection methods on 4 different types of classifiers. The hight of each bar indicates the times that MDFS win 
(draw or lose) against the counterpart feature selection method over all feature subset sizes and data sets. 



feature subset sizes and data sets. It can be clearly seen that MDFS performed significantly better 
than all the compared methods on all the 4 classifiers. 

Finally, Table 10 shows the runtime of every feature selection method on the 8 data sets. As 
analyzed in Section 4.4, the computational complexity of MDFS is comparable with that of most 
of the compared feature selection methods (except the FSDD method and the mRMR method). 
However, due to the constant factor (e.g., number of classes c) in the complexity analysis and 
implementation details of these algorithms, the actual runtime may deviate from the complexity 
analysis and is only indicative. 



6. Conclusions and discussions 

Although numerous successful feature selection methods have been developed for accuracy- 
oriented classification systems, recent studies revealed that accuracy itself is not an appropriate 
performance metric in many real-world practices and may lead to undesirable classification sys- 
tems. Instead, AUC and MAUC are adopted more and more commonly in the literature. This 
shift of performance metric raises the need for new feature selection methods. In this study, we 
proposed the MDFS feature selection method for MAUC-oriented classification systems. It was 
inspired by the observation that MAUC value of a classification system is actually the average of 
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Table 2: Average MAUC obtained with the nine compared methods on the ISOLET data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(:l:). No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 

Feature Subset Size 







10 


20 


30 


40 


50 


60 


70 


80 


90 


100 




MDFS 


0.958 


0.977 


0.982 


0.984 


0.985 


0.985 


0.986 


0.986 


0.986 


0.986 




MAUCD 


0.906"'" 


0.924* 


0.917* 


0.930* 


0.941* 


0.948* 


0.952* 


0.962* 


0.973* 


0.977* 




itiRMR 


0.962' 


0.973* 


0.977* 


0.979* 


0.981* 


0.982* 


0.983* 


0.984* 


0.984* 


0.985* 




FSDD 


0.941+ 


0.955* 


0.961* 


0.962* 


0.966* 


0.972* 


0.974* 


0.977* 


0.980* 


0.980* 


Naive Bayes 


SCHI 


0.961* 


0.976 


0.978* 


0.978* 


0.978* 


0.978* 


0.978* 


0.978* 


0.978* 


0.979* 




ssu 


0.954* 


0.974* 


0.975* 


0.977* 


0.976* 


0.975* 


0.976* 


0.975* 


0.976* 


0.975* 




CHI 


0.945* 


0.954* 


0.969* 


0.972* 


0.972* 


0.973* 


0.974* 


0.976* 


0.975* 


0.975* 




su 


0.904* 


0.928* 


0.946* 


0.957* 


0.961* 


0.971* 


0.974* 


0.976* 


0.978* 


0.978* 




ReliefF 


0.911* 


0.943* 


0.956* 


0.963* 


0.969* 


0.973* 


0.976* 


0.979* 


0.980* 


0.982* 




MDFS 


0.836 


0.886 


0.904 


0.911 


0.917 


0.921 


0.923 


0.924 


0.927 


0.928 




MAUCD 


0.727* 


0.761* 


0.768* 


0.779* 


0.802* 


0.811* 


0.813* 


0.840* 


0.872* 


0.876* 




mRMR 


0.840 


0.869* 


0.880* 


0.884* 


0.890* 


0.896* 


0.899* 


0.902* 


0.902* 


0.904* 




FSDD 


0.804* 


0.842* 


0.867* 


0.870* 


0.887* 


0.900* 


0.901* 


0.903* 


0.909* 


0.910* 


C4.5 


SCHI 


0.862* 


0.879* 


0.896* 


0.900* 


0.905* 


0.909* 


0.915* 


0.918* 


0.922* 


0.922* 




SSU 


0.869* 


0.885 


0.896* 


0.899* 


0.904* 


0.905* 


0.910* 


0.911* 


0.915* 


0.917* 




CHI 


0.849* 


0.853* 


0.881* 


0.898* 


0.906* 


0.909* 


0.912* 


0.915* 


0.916* 


0.917* 




SU 


0.793* 


0.816* 


0.840* 


0.856* 


0.861* 


0.894* 


0.899* 


0.905* 


0.909* 


0.909* 




ReliefF 


0.764* 


0.819* 


0.848* 


0.865* 


0.878* 


0.886* 


0.892* 


0.899* 


0.903* 


0.907* 





MDFS 


0.780 


0.869 


0.903 


0.919 


0.930 


0.937 


0.943 


0.947 


0.951 


0.954 




MAUCD 


0.674* 


0.732* 


0.746* 


0.767* 


0.791* 


0.802* 


0.813* 


0.843* 


0.869* 


0.878* 




mRMR 


0.774* 


0.852* 


0.879* 


0.889* 


0.899* 


0.909* 


0.911* 


0.915* 


0.915* 


0.917* 




FSDD 


0.710* 


0.795* 


0.827* 


0.838* 


0.864* 


0.889* 


0.891* 


0.902* 


0.910* 


0.911* 


INN 


SCHI 


0.788* 


0.852* 


0.881* 


0.894* 


0.906* 


0.912* 


0.919* 


0.927* 


0.930* 


0.934* 




SSU 


0.780 


0.849* 


0.873* 


0.885* 


0.895* 


0.902* 


0.910* 


0.914* 


0.922* 


0.926* 




CHI 


0.745* 


0.795* 


0.862* 


0.886* 


0.895* 


0.906* 


0.915* 


0.921* 


0.925* 


0.926* 




SU 


0.690* 


0.761* 


0.800* 


0.828* 


0.842* 


0.880* 


0.890* 


0.902* 


0.913* 


0.916* 




ReliefF 


0.687* 


0.767* 


0.810* 


0.835* 


0.857* 


0.869* 


0.879* 


0.888* 


0.895* 


0.902* 




MDFS 


0.808 


0.892 


0.927 


0.944 


0.953 


0.960 


0.965 


0.969 


0.972 


0.974 




MAUCD 


0.707* 


0.763* 


0.773* 


0.801* 


0.834* 


0.845* 


0.852* 


0.882* 


0.911* 


0.919* 




mRMR 


0.809 


0.871* 


0.901* 


0.914* 


0.929* 


0.939* 


0.942* 


0.947* 


0.949'" 


0.952' 




FSDD 


0.750* 


0.826* 


0.859* 


0.871* 


0.891* 


0.916* 


0.919* 


0.930* 


0.939* 


0.941* 


SVM 


SCHI 


0.817* 


0.883* 


0.913* 


0.927* 


0.941* 


0.948* 


0.954* 


0.960* 


0.963* 


0.966* 




SSU 


0.826* 


0.880* 


0.905* 


0.922* 


0.933* 


0.940* 


0.949* 


0.953* 


0.958* 


0.961* 




CHI 


0.780* 


0.828* 


0.891* 


0.914* 


0.924* 


0.933* 


0.940* 


0.947* 


0.952* 


0.953* 




SU 


0.731* 


0.796* 


0.832* 


0.863* 


0.877* 


0.910* 


0.922* 


0.934* 


0.942* 


0.946* 




ReliefF 


0.721* 


0.798* 


0.841* 


0.869* 


0.890* 


0.904* 


0.917* 


0.928* 


0.935* 


0.942* 
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Table 3: Average MAUC obtained with the nine compared methods on the MNIST data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(0. No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 

Feature Subset Size 







10 


20 


30 


40 


50 


60 


70 


80 


90 


100 




MDFS 


0.889 


0.929 


0.944 


0.952 


0.956 


0.961 


0.963 


0.965 


0.967 


0.968 




MAUCD 


0.828"'" 


0.889+ 


0.922+ 


0.936+ 


0.944+ 


0.950+ 


0.955+ 


0.957+ 


0.958+ 


0.960+ 




mRMR 


0.877"'" 


0.914+ 


0.933+ 


0.939+ 


0.945+ 


0.950+ 


0.952+ 


0.954+ 


0.954+ 


0.956+ 




FSDD 


0.500+ 


0.500+ 


0.500+ 


0.720+ 


0.732+ 


0.754+ 


0.793+ 


0.806+ 


0.802+ 


0.795+ 


Naive Bayes 


SCHI 


0.803+ 


0.842+ 


0.871+ 


0.886+ 


0.906+ 


0.914+ 


0.919+ 


0.923+ 


0.927+ 


0.931+ 


SSU 


0.826+ 


0.870+ 


0.888+ 


0.901+ 


0.915+ 


0.922+ 


0.926+ 


0.929+ 


0.932+ 


0.935+ 




CHI 


0.834+ 


0.879+ 


0.914+ 


0.928+ 


0.933+ 


0.939+ 


0.941 + 


0.941 + 


0.944+ 


0.947+ 




su 


0.844+ 


0.891 + 


0.918+ 


0.932+ 


0.939+ 


0.941 + 


0.942+ 


0.946+ 


0.949+ 


0.950+ 




ReliefF 


0.843+ 


0.892+ 


0.917+ 


0.933+ 


0.943+ 


0.950+ 


0.955+ 


0.959+ 


0.962+ 


0.964+ 




MDFS 


0.829 


0.865 


0.878 


0.884 


0.888 


0.893 


0.894 


0.898 


0.900 


0.901 




MAUCD 


0.766+ 


0.836+ 


0.863+ 


0.872+ 


0.877+ 


0.880+ 


0.886+ 


0.889+ 


0.889+ 


0.891+ 




mRMR 


0.809+ 


0.851 + 


0.866+ 


0.876+ 


0.885+ 


0.890+ 


0.893 


0.895+ 


0.897+ 


0.896+ 




FSDD 


0.500+ 


0.500+ 


0.500+ 


0.697+ 


0.725+ 


0.740+ 


0.764+ 


0.769+ 


0.768+ 


0.768+ 


C4.5 


SCHI 


0.818+ 


0.839+ 


0.852+ 


0.860+ 


0.871 + 


0.878+ 


0.880+ 


0.882+ 


0.888+ 


0.890+ 




SSU 


0.807+ 


0.838+ 


0.853+ 


0.872+ 


0.883+ 


0.889+ 


0.890+ 


0.893+ 


0.894+ 


0.897+ 




CHI 


0.784+ 


0.824+ 


0.852+ 


0.869+ 


0.878+ 


0.883+ 


0.884+ 


0.884+ 


0.888+ 


0.892+ 




SU 


0.804+ 


0.838+ 


0.864+ 


0.872+ 


0.879+ 


0.883+ 


0.885+ 


0.888+ 


0.890+ 


0.892+ 




ReliefF 


0.796+ 


0.837+ 


0.859+ 


0.871+ 


0.881 + 


0.885+ 


0.889+ 


0.892+ 


0.895+ 


0.896+ 





MDFS 


0.764 


0.855 


0.892 


0.912 


0.927 


0.938 


0.947 


0.952 


0.956 


0.959 




MAUCD 


0.712+ 


0.809+ 


0.860+ 


0.886+ 


0.897+ 


0.910+ 


0.922+ 


0.925+ 


0.930+ 


0.935+ 




mRMR 


0.761+ 


0.842+ 


0.879+ 


0.904+ 


0.920+ 


0.930+ 


0.936+ 


0.941+ 


0.945+ 


0.949+ 




FSDD 


0.500+ 


0.500+ 


0.500+ 


0.595+ 


0.646+ 


0.678+ 


0.714+ 


0.726+ 


0.728+ 


0.728+ 


INN 


SCHI 


0.679+ 


0.762+ 


0.814+ 


0.843+ 


0.872+ 


0.889+ 


0.901 + 


0.911 + 


0.919+ 


0.928+ 




SSU 


0.674+ 


0.749+ 


0.792+ 


0.834+ 


0.872+ 


0.895+ 


0.906+ 


0.914+ 


0.924+ 


0.931+ 




CHI 


0.725+ 


0.800+ 


0.853+ 


0.879+ 


0.898+ 


0.913+ 


0.919+ 


0.923+ 


0.927+ 


0.934+ 




SU 


0.737+ 


0.810+ 


0.852+ 


0.880+ 


0.899+ 


0.911+ 


0.916+ 


0.923+ 


0.927+ 


0.933+ 




ReliefF 


0.727+ 


0.823+ 


0.873+ 


0.902+ 


0.919+ 


0.931+ 


0.939+ 


0.947+ 


0.952+ 


0.956+ 




MDFS 


0.786 


0.878 


0.914 


0.930 


0.939 


0.945 


0.948 


0.951 


0.952 


0.953 




MAUCD 


0.738+ 


0.836+ 


0.883+ 


0.908+ 


0.918+ 


0.925+ 


0.935+ 


0.938+ 


0.941 + 


0.944+ 




mRMR 


0.778+ 


0.868+ 


0.904+ 


0.924+ 


0.936+ 


0.944 


0.948 


0.951 


0.953 


0.955* 




FSDD 


0.500+ 


0.500+ 


0.500+ 


0.648+ 


0.693+ 


0.715+ 


0.746+ 


0.757+ 


0.759+ 


0.760+ 


SVM 


SCHI 


0.737+ 


0.796+ 


0.843+ 


0.873+ 


0.907+ 


0.923+ 


0.933+ 


0.941+ 


0.948+ 


0.953 




SSU 


0.725+ 


0.788+ 


0.824+ 


0.866+ 


0.904+ 


0.927+ 


0.937+ 


0.944+ 


0.950+ 


0.955* 




CHI 


0.755+ 


0.826+ 


0.882+ 


0.906+ 


0.919+ 


0.931 + 


0.934+ 


0.937+ 


0.940+ 


0.944+ 




SU 


0.763+ 


0.836+ 


0.883+ 


0.909+ 


0.923+ 


0.932+ 


0.935+ 


0.940+ 


0.942+ 


0.946+ 




ReliefF 


0.752+ 


0.850+ 


0.900+ 


0.922+ 


0.933+ 


0.940+ 


0.944+ 


0.947+ 


0.949+ 


0.950+ 
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Table 4: Average MAUC obtained with the nine compared methods on the USPS data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(0. No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 

Feature Subset Size 







10 


20 


30 


40 


50 


60 


70 


80 


90 


100 




MDFS 


0.929 


0.962 


0.972 


0.976 


0.978 


0.979 


0.980 


0.981 


0.981 


0.981 




MAUCD 


0.912'" 


0.928+ 


0.948+ 


0.964+ 


0.966+ 


0.971 + 


0.971 + 


0.974+ 


0.975+ 


0.977+ 




itiRMR 


0.921'" 


0.947+ 


0.960+ 


0.965+ 


0.967+ 


0.973+ 


0.975+ 


0.976+ 


0.977+ 


0.978+ 




FSDD 


0.914+ 


0.929+ 


0.944+ 


0.951+ 


0.967+ 


0.968+ 


0.969+ 


0.972+ 


0.974+ 


0.975+ 


Naive Bayes 


SCHI 


0.907+ 


0.953+ 


0.968+ 


0.970+ 


0.971+ 


0.972+ 


0.972+ 


0.972+ 


0.972+ 


0.971+ 


ssu 


0.924+ 


0.945+ 


0.955+ 


0.961+ 


0.962+ 


0.963+ 


0.965+ 


0.965+ 


0.965+ 


0.965+ 




CHI 


0.896+ 


0.923+ 


0.950+ 


0.963+ 


0.966+ 


0.967+ 


0.969+ 


0.973+ 


0.974+ 


0.976+ 




su 


0.817+ 


0.926+ 


0.937+ 


0.943+ 


0.951 + 


0.956+ 


0.963+ 


0.967+ 


0.969+ 


0.970+ 




ReliefF 


0.898+ 


0.938+ 


0.954+ 


0.963+ 


0.968+ 


0.971 + 


0.973+ 


0.975+ 


0.977+ 


0.978+ 




MDFS 


0.879 


0.908 


0.920 


0.923 


0.927 


0.928 


0.930 


0.932 


0.933 


0.933 




MAUCD 


0.862+ 


0.873+ 


0.891 + 


0.915+ 


0.917+ 


0.922+ 


0.929 


0.932 


0.931 + 


0.932 




mRMR 


0.859+ 


0.893+ 


0.895+ 


0.904+ 


0.911 + 


0.921 + 


0.928+ 


0.930+ 


0.931 + 


0.933 




FSDD 


0.862+ 


0.874+ 


0.887+ 


0.903+ 


0.923+ 


0.926 


0.928+ 


0.932 


0.931 


0.933 


C4.5 


SCHI 


0.860+ 


0.901+ 


0.911 + 


0.918+ 


0.925 


0.925+ 


0.925+ 


0.925+ 


0.925+ 


0.928+ 




SSU 


0.886* 


0.895+ 


0.904+ 


0.910+ 


0.915+ 


0.918+ 


0.923+ 


0.926+ 


0.927+ 


0.930+ 




CHI 


0.853+ 


0.873+ 


0.898+ 


0.918+ 


0.924+ 


0.924+ 


0.928+ 


0.930+ 


0.930+ 


0.931+ 




SU 


0.763+ 


0.892+ 


0.899+ 


0.902+ 


0.909+ 


0.914+ 


0.922+ 


0.926+ 


0.927+ 


0.928+ 




ReliefF 


0.853+ 


0.883+ 


0.896+ 


0.907+ 


0.912+ 


0.917+ 


0.921 + 


0.925+ 


0.926+ 


0.928+ 





MDFS 


0.813 


0.905 


0.939 


0.954 


0.963 


0.968 


0.972 


0.975 


0.976 


0.977 




MAUCD 


0.790+ 


0.844+ 


0.885+ 


0.926+ 


0.937+ 


0.950+ 


0.957+ 


0.965+ 


0.969+ 


0.973+ 




mRMR 


0.796+ 


0.884+ 


0.915+ 


0.935+ 


0.945+ 


0.957+ 


0.965+ 


0.970+ 


0.972+ 


0.975+ 




FSDD 


0.787+ 


0.846+ 


0.875+ 


0.913+ 


0.947+ 


0.955+ 


0.961+ 


0.968+ 


0.972+ 


0.974+ 


INN 


SCHI 


0.782+ 


0.890+ 


0.931 + 


0.947+ 


0.954+ 


0.959+ 


0.961 + 


0.965+ 


0.967+ 


0.968+ 




SSU 


0.807+ 


0.872+ 


0.912+ 


0.935+ 


0.945+ 


0.953+ 


0.961 + 


0.966+ 


0.969+ 


0.972+ 




CHI 


0.761 + 


0.832+ 


0.888+ 


0.932+ 


0.944+ 


0.951 + 


0.959+ 


0.966+ 


0.969+ 


0.972+ 




SU 


0.656+ 


0.841+ 


0.876+ 


0.895+ 


0.916+ 


0.928+ 


0.949+ 


0.958+ 


0.963+ 


0.967+ 




ReliefF 


0.786+ 


0.874+ 


0.913+ 


0.933+ 


0.946+ 


0.955+ 


0.961+ 


0.966+ 


0.970+ 


0.972+ 




MDFS 


0.829 


0.924 


0.957 


0.968 


0.975 


0.978 


0.980 


0.982 


0.984 


0.985 




MAUCD 


0.817+ 


0.868+ 


0.913+ 


0.948+ 


0.958+ 


0.967+ 


0.972+ 


0.979+ 


0.982+ 


0.984+ 




mRMR 


0.820+ 


0.904+ 


0.939+ 


0.958+ 


0.966+ 


0.973+ 


0.979+ 


0.981 + 


0.982+ 


0.984+ 




FSDD 


0.817+ 


0.866+ 


0.904+ 


0.938+ 


0.965+ 


0.971+ 


0.975+ 


0.980+ 


0.982+ 


0.983+ 


SVM 


SCHI 


0.809+ 


0.916+ 


0.952+ 


0.965+ 


0.969+ 


0.973+ 


0.977+ 


0.978+ 


0.980+ 


0.981+ 




SSU 


0.835* 


0.902+ 


0.936+ 


0.955+ 


0.964+ 


0.971 + 


0.974+ 


0.978+ 


0.980+ 


0.982+ 




CHI 


0.793+ 


0.860+ 


0.912+ 


0.953+ 


0.963+ 


0.969+ 


0.974+ 


0.979+ 


0.981 + 


0.983+ 




SU 


0.681 + 


0.877+ 


0.907+ 


0.923+ 


0.939+ 


0.950+ 


0.968+ 


0.974+ 


0.976+ 


0.979+ 




ReliefF 


0.810+ 


0.893+ 


0.933+ 


0.951 + 


0.963+ 


0.970+ 


0.975+ 


0.979+ 


0.981 + 


0.983+ 
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Table 5: Average MAUC obtained with the nine compared methods on the Phoneme data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(t). No superscript is used if no statistical 
significant difi'erence is detected. The largest MAUC value of each configuration is in boldface. 



Feature Subset Size 







1 n 


90 
zu 








ou 


70 
/u 




QO 


100 




MDFS 


0.974 


0.980 


0.982 


0.983 


0.983 


0.983 


0.983 


0.983 


0.983 


0.983 




MAUCD 


0.956"'" 


0.959'" 


0.977"'" 


0.981'" 


0.980* 


0.980* 


0.979* 


0.978* 


0.977* 


0.976* 




mRMR 


0.977* 


0.979"'" 


0.979'" 


0.980'" 


0.981* 


0.981* 


0.981* 


0.981* 


0.981* 


0.981* 




FSDD 


0.937* 


0.954* 


0.977' 


0.982* 


0.983 


0.982* 


0.981* 


0.981* 


0.980* 


0.979* 


IN dive ijayco 












yj.yoD 


n 08^ 


U.Vo J 


\3.yoj 


A QQOt 

U.VcSz' 


A QQOt 




ooU 


u.y /D 


u.you 


V.yoZ. 




n os7t 


n os7t 


u.yoZ' 


U.yoZ' 


Kj.yoZ.' 


A oaot 






U.V4-0 






n QS7t 


n QSi t 




n 070 1 

\j.y ly 


n 077 1 


yj.y 10 


A Q7ylt 

yj.y 1^ 






yj.yjO 




V.y 1 o 


U.VoZ 


n 087t 
U.yoZ 


n 081 1 


n Q70t 
u.y ly 


(\ 07 8 1 
yj.y 1 o 


U . V / D 


A 07 




RplipfP 






\j.y 1 J 


089 1 


Qj<4* 


Qj<4* 


n QKd* 








\j»yoj 




MDFS 


0.945 


0.947 


0.944 


0.939 


0.938 


0.936 


0.933 


0.932 


0.931 


0.930 




MAUCD 


0.925+ 


0.921* 


0.938* 


0.937 


0.937 


0.932^ 


0.931* 


0.931 


0.928 


0.928 




mRMR 


0.950* 


0.946 


0.939* 


0.937 


0.933* 


0.930"' 


0.932 


0.933 


0.932 


0.930 




rSUD 




A AAA'l' 


A AQ O t 


A ATO 








A ATI 


A ATT 


A AQ 1 


C4.5 


SCHl 


0.949* 


0.943* 


0.939* 


0.935"' 


0.933"' 


0.933 


0.933 


0.931 


0.931 


0.929 




ssu 


0.945 


0.944* 


0.937* 


0.936"' 


0.931"' 


0.930"' 


0.932 


0.929* 


0.929 


0.928 




CHI 


0.913* 


0.914* 


0.925* 


0.935* 


0.932* 


0.929* 


0.928* 


0.927* 


0.925* 


0.923* 




su 


0.901* 


0.910* 


0.939* 


0.937* 


0.936 


0.933 


0.930* 


0.927* 


0.927* 


0.929 




ReliefF 


0.923* 


0.925* 


0.935* 


0.941 


0.938 


0.937 


0.937* 


0.934 


0.934 


0.932 




MDFS 


0.903 


0.923 


0.926 


0.927 


0.928 


0.929 


0.929 


0.928 


0.928 


0.928 




MAUCD 


0.865* 


0.892* 


0.917* 


0.926 


0.922* 


0.921* 


0.920* 


0.920* 


0.920* 


0.918* 




mRMR 


0.905 


0.919* 


0.921* 


0.927 


0.926* 


0.927 


0.929 


0.931* 


0.933* 


0.932* 




FSDD 


0.849* 


0.882' 


0.917* 


0.931* 


0.933* 


0.932* 


0.929 


0.925* 


0.925'" 


0.926 


INN 


SCHl 


0.906 


0.920* 


0.922* 


0.923* 


0.926 


0.928 


0.929 


0.931* 


0.930 


0.931* 




SSU 


0.898* 


0.918* 


0.921* 


0.925 


0.925* 


0.926* 


0.928 


0.930 


0.931* 


0.931* 




CHI 


0.858* 


0.884* 


0.890* 


0.925 


0.923* 


0.923* 


0.920"' 


0.917* 


0.915* 


0.913* 




SU 


0.843* 


0.882* 


0.919* 


0.930* 


0.928 


0.926* 


0.923* 


0.921* 


0.920* 


0.919* 




ReliefF 


0.872* 


0.901* 


0.919* 


0.929* 


0.931* 


0.932* 


0.932* 


0.931* 


0.931* 


0.930* 




MDFS 


0.917 


0.939 


0.943 


0.946 


0.948 


0.949 


0.950 


0.950 


0.950 


0.951 




MAUCD 


0.867* 


0.904* 


0.933* 


0.938* 


0.940* 


0.943* 


0.943* 


0.944* 


0.944* 


0.945* 




mRMR 


0.925* 


0.935* 


0.937* 


0.939* 


0.941* 


0.943* 


0.945* 


0.947* 


0.949 


0.951 




FSDD 


0.820* 


0.890* 


0.933* 


0.947 


0.949 


0.949 


0.949 


0.949 


0.949 


0.949 


SVM 


SCHl 


0.926* 


0.935* 


0.939* 


0.942* 


0.945* 


0.947* 


0.948* 


0.949* 


0.950 


0.950 




SSU 


0.919 


0.934* 


0.938* 


0.942* 


0.944* 


0.945* 


0.948* 


0.949* 


0.950 


0.951 




CHI 


0.860* 


0.894* 


0.918* 


0.939* 


0.940* 


0.942* 


0.942* 


0.942* 


0.943'" 


0.943* 




SU 


0.830* 


0.892* 


0.935* 


0.943* 


0.945* 


0.945^" 


0.945* 


0.946* 


0.946* 


0.948* 




ReliefF 


0.868* 


0.911* 


0.938* 


0.946 


0.949 


0.950* 


0.951* 


0.951 


0.951 


0.951 
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Table 6: Average MAUC obtained with the nine compared methods on the Washington data set. The results were 
obtained by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the 
Wilcoxon signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods 
that performed significantly worse (better) than MDFS are highlighted with t(t). No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 

Feature Subset Size 







10 


20 




40 


50 


60 


70 


80 


90 


100 




MDFS 


0.970 


0.973 


0.973 


0.972 


0.972 


0.971 


0.971 


0.970 


0.970 


0.970 




MAUCD 


0.934"'" 


0.930+ 


0.927+ 


0.926+ 


0.926+ 


0.938+ 


0.941 + 


0.940+ 


0.943+ 


0.949+ 




mRMR 


0.968"'" 


0.970+ 


0.971 + 


0.970+ 


0.970+ 


0.969+ 


0.969+ 


0.969+ 


0.969+ 


0.969+ 




FSDD 


0.918+ 


0.934+ 


0.939+ 


0.940+ 


0.939+ 


0.945+ 


0.948+ 


0.947+ 


0.946+ 


0.947+ 


Naive Bayes 


SCHI 


0.968+ 


0.971+ 


0.971+ 


0.970+ 


0.969+ 


0.968+ 


0.968+ 


0.968+ 


0.968+ 


0.968+ 




SSU 


0.969+ 


0.971+ 


0.971+ 


0.971+ 


0.971+ 


0.971+ 


0.970+ 


0.970+ 


0.970+ 


0.969+ 




CHI 


0.946+ 


0.945+ 


0.944+ 


0.942+ 


0.942+ 


0.940+ 


0.950+ 


0.953+ 


0.955+ 


0.956+ 




su 


0.944+ 


0.947+ 


0.948+ 


0.949+ 


0.951 + 


0.953+ 


0.955+ 


0.956+ 


0.957+ 


0.957+ 




ReliefF 


0.899+ 


0.910+ 


0.920+ 


0.927+ 


0.936+ 


0.941 + 


0.947+ 


0.956+ 


0.960+ 


0.962+ 




MDFS 


0.961 


0.955 


0.952 


0.949 


0.947 


0.947 


0.946 


0.944 


0.943 


0.942 




MAUCD 


0.948+ 


0.948+ 


0.941 + 


0.935+ 


0.930+ 


0.933"^ 


0.934+ 


0.932+ 


0.932+ 


0.930+ 




mRMR 


0.960 


0.952+ 


0.951 


0.949 


0.947 


0.945+ 


0.945 


0.944 


0.943 


0.941 




FSDD 


0.922+ 


0.930+ 


0.923+ 


0.919' 


0.916' 


0.932+ 


0.934+ 


0.932+ 


0.930+ 


0.931+ 


C4.5 


SCHI 


0.964* 


0.957* 


0.954 


0.950 


0.949 


0.948 


0.947 


0.944 


0.942 


0.941 




ssu 


0.958+ 


0.953+ 


0.949+ 


0.947+ 


0.946 


0.946 


0.945 


0.944 


0.942 


0.942 




CHI 


0.948+ 


0.940+ 


0.933+ 


0.929+ 


0.930+ 


0.927+ 


0.942+ 


0.940+ 


0.939+ 


0.939+ 




SU 


0.951+ 


0.948+ 


0.946+ 


0.946+ 


0.945+ 


0.943+ 


0.942+ 


0.941+ 


0.942+ 


0.942 




ReliefF 


0.918+ 


0.927+ 


0.933+ 


0.937+ 


0.940+ 


0.940+ 


0.941 + 


0.943 


0.943 


0.943* 





MDFS 


0.924 


0.930 


0.931 


0.932 


0.932 


0.932 


0.932 


0.932 


0.933 


0.933 




MAUCD 


0.894+ 


0.901+ 


0.903+ 


0.902+ 


0.905+ 


0.912+ 


0.912+ 


0.914+ 


0.917+ 


0.920+ 




mRMR 


0.927* 


0.928 


0.929+ 


0.929+ 


0.929+ 


0.929+ 


0.929+ 


0.930+ 


0.930+ 


0.930+ 




FSDD 


0.855+ 


0.883+ 


0.887+ 


0.890'" 


0.895+ 


0.911 + 


0.913+ 


0.913+ 


0.912+ 


0.916+ 


INN 


SCHI 


0.920+ 


0.924+ 


0.924+ 


0.925+ 


0.926+ 


0.927+ 


0.927+ 


0.929+ 


0.931 + 


0.931 + 




SSU 


0.914+ 


0.919+ 


0.922+ 


0.924+ 


0.927+ 


0.929+ 


0.930+ 


0.932 


0.932 


0.933 




CHI 


0.891 + 


0.897+ 


0.902+ 


0.903+ 


0.907+ 


0.910+ 


0.922+ 


0.921 + 


0.923+ 


0.924+ 




SU 


0.905+ 


0.914+ 


0.916+ 


0.919+ 


0.921+ 


0.921+ 


0.923+ 


0.923+ 


0.924+ 


0.925+ 




ReliefF 


0.865+ 


0.895+ 


0.906+ 


0.913+ 


0.918+ 


0.920+ 


0.922+ 


0.925+ 


0.927+ 


0.927+ 




MDFS 


0.935 


0.941 


0.941 


0.942 


0.942 


0.942 


0.942 


0.942 


0.942 


0.943 




MAUCD 


0.919+ 


0.924+ 


0.928+ 


0.928"' 


0.929"' 


0.935"' 


0.935+ 


0.935+ 


0.939+ 


0.942+ 




mRMR 


0.936 


0.939+ 


0.939+ 


0.940 


0.940+ 


0.941 + 


0.941 


0.942 


0.943 


0.943 




FSDD 


0.856+ 


0.905+ 


0.907+ 


0.909+ 


0.911+ 


0.933+ 


0.935+ 


0.935+ 


0.935+ 


0.937+ 


SVM 


SCHI 


0.936 


0.938+ 


0.938+ 


0.939+ 


0.940+ 


0.940+ 


0.941+ 


0.943* 


0.945* 


0.945* 




SSU 


0.931 + 


0.937+ 


0.939+ 


0.942 


0.944* 


0.944* 


0.944* 


0.944* 


0.944* 


0.945* 




CHI 


0.910+ 


0.915+ 


0.919+ 


0.922+ 


0.930+ 


0.931 + 


0.937+ 


0.939+ 


0.939+ 


0.941+ 




SU 


0.927+ 


0.933+ 


0.935+ 


0.937+ 


0.938+ 


0.939+ 


0.940+ 


0.941 + 


0.942 


0.943 




ReliefF 


0.905+ 


0.922+ 


0.928+ 


0.932+ 


0.936+ 


0.938+ 


0.939+ 


0.941+ 


0.941 + 


0.942+ 
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Table 7: Average MAUC obtained with the nine compared methods on the Indiana data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(t). No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 

Feature Subset Size 







10 


20 


30 


40 


50 


60 


70 


80 


90 


100 




MDFS 


0.890 


0.892 


0.893 


0.891 


0.890 


0.890 


0.890 


0.890 


0.890 


0.890 




MAUCD 


0.869'' 


0.874* 


0.873* 


0.872* 


0.872* 


0.872* 


0.873* 


0.874* 


0.875* 


0.877* 




mRMR 


0.894* 


0.894* 


0.896* 


0.896* 


0.897* 


0.895* 


0.894* 


0.894* 


0.893* 


0.893* 




FSDD 


0.867* 


0.868* 


0.865* 


0.865* 


0.866* 


0.869* 


0.871* 


0.873* 


0.875* 


0.877* 


Naive Bayes 


SCHI 


0.888* 


0.888* 


0.888* 


0.886* 


0.886* 


0.885* 


0.885* 


0.885* 


0.884* 


0.884* 




SSU 


0.887* 


0.886* 


0.885* 


0.884* 


0.884* 


0.883* 


0.883* 


0.883* 


0.883* 


0.883* 




CHI 


0.867* 


0.866* 


0.864* 


0.868* 


0.872* 


0.874* 


0.876* 


0.877* 


0.878* 


0.878* 




su 


0.868* 


0.867* 


0.865* 


0.866* 


0.870* 


0.872* 


0.874* 


0.876* 


0.877* 


0.878* 




ReliefF 


0.874* 


0.882* 


0.883* 


0.882* 


0.882* 


0.882* 


0.881* 


0.881* 


0.881* 


0.882* 




MDFS 


0.882 


0.889 


0.892 


0.893 


0.893 


0.893 


0.891 


0.893 


0.892 


0.891 




MAUCD 


0.797* 


0.847* 


0.846* 


0.847* 


0.844* 


0.842* 


0.852* 


0.851* 


0.868* 


0.876* 




mRMR 


0.866* 


0.870* 


0.876* 


0.882* 


0.882* 


0.886* 


0.888* 


0.888* 


0.888* 


0.889 




FSDD 


0.804* 


0.805* 


0.808* 


0.817* 


0.840* 


0.850* 


0.851* 


0.849* 


0.847* 


0.868* 


C4.5 


SCHI 


0.904* 


0.904* 


0.902* 


0.902* 


0.900* 


0.899* 


0.900* 


0.899* 


0.900* 


0.899* 




SSU 


0.895* 


0.901* 


0.902* 


0.902* 


0.902* 


0.900* 


0.900* 


0.899* 


0.899* 


0.898* 




CHI 


0.797* 


0.811* 


0.823* 


0.854* 


0.870* 


0.877* 


0.879* 


0.881* 


0.881* 


0.881* 




SU 


0.810* 


0.808* 


0.819* 


0.845* 


0.855* 


0.859* 


0.874* 


0.875* 


0.877* 


0.878* 




ReliefF 


0.860* 


0.874* 


0.884* 


0.887* 


0.889* 


0.889* 


0.889* 


0.890 


0.890 


0.891 





MDFS 


0.864 


0.896 


0.909 


0.917 


0.919 


0.921 


0.923 


0.924 


0.923 


0.922 




MAUCD 


0.759* 


0.852* 


0.876* 


0.875* 


0.874* 


0.855* 


0.868* 


0.872* 


0.883* 


0.904* 




mRMR 


0.815* 


0.826* 


0.841* 


0.863* 


0.875* 


0.882* 


0.887* 


0.894* 


0.896* 


0.898* 




FSDD 


0.752* 


0.796* 


0.819* 


0.841* 


0.860* 


0.870* 


0.872* 


0.869* 


0.854* 


0.886* 


INN 


SCHI 


0.877* 


0.891* 


0.905* 


0.911* 


0.915 


0.915* 


0.916* 


0.918* 


0.916* 


0.915* 




SSU 


0.897* 


0.922* 


0.931* 


0.935* 


0.938* 


0.939* 


0.940* 


0.940* 


0.940* 


0.940* 




CHI 


0.762* 


0.820* 


0.847* 


0.898* 


0.915* 


0.924* 


0.927* 


0.927* 


0.929* 


0.928* 




SU 


0.767* 


0.798* 


0.827* 


0.863* 


0.877* 


0.886* 


0.907* 


0.912* 


0.914* 


0.917* 




ReliefF 


0.866 


0.897 


0.913* 


0.919 


0.922* 


0.925* 


0.925* 


0.926* 


0.927* 


0.927* 




MDFS 


0.890 


0.927 


0.941 


0.947 


0.950 


0.952 


0.954 


0.955 


0.957 


0.957 




MAUCD 


0.774* 


0.867* 


0.882* 


0.887* 


0.891* 


0.894* 


0.918* 


0.921* 


0.948* 


0.953* 




mRMR 


0.888 


0.915* 


0.931* 


0.938* 


0.942* 


0.946* 


0.950* 


0.953* 


0.955* 


0.957 




FSDD 


0.786* 


0.822* 


0.836* 


0.863* 


0.894* 


0.910* 


0.916* 


0.920* 


0.921* 


0.952* 


SVM 


SCHI 


0.915* 


0.936* 


0.943* 


0.946 


0.948* 


0.949* 


0.953 


0.955 


0.956 


0.958 




SSU 


0.927* 


0.949* 


0.954* 


0.957* 


0.959* 


0.961* 


0.961* 


0.962* 


0.963* 


0.964* 




CHI 


0.778* 


0.833* 


0.864* 


0.906* 


0.937* 


0.945* 


0.950* 


0.953* 


0.952* 


0.953* 




SU 


0.784* 


0.818* 


0.848* 


0.887* 


0.912* 


0.926* 


0.947* 


0.950* 


0.952'" 


0.954* 




ReliefF 


0.882* 


0.911* 


0.930* 


0.940* 


0.946* 


0.950* 


0.953 


0.955 


0.956 


0.957 
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Table 8: Average MAUC obtained with the nine compared methods on the Synthetic data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(t). No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 



Feature Subset Size 







10 


OA 

20 


OA 

30 


40 


50 


60 




MDFS 


0.992 


0.996 


0.997 


0.998 


0.998 


0.998 




MAUCU 


u.y / i 


A A/;ot 


A A OA"!" 


A A AO t 

0.993' 


A AAot 

0.99o' 


O.Vvs 




"D "N /I'd 

mKJvlK 




A AA'jf 


A AACt 


A AATI' 

0,99/' 


A AAOt 

0,996' 


0.99o 




roJJJJ 


n QTi t 


U-VOo ' 




n QQ1 1 


n QQTt 


n oofi 


Naive Bayes 




A no /I t 


A AOOt 


A AAA'I' 

0.990' 


A AATI' 

0.99/' 


\3.yyy 


O.yys 






0.975' 


0.979' 


A AOCt 

0,985' 


A A A A t 

0,994' 


A AAOt 

0,998' 


0.998 




CHI 


f\ OTI t 

u.y / 1 ' 


U.9oo' 


0,9oz' 


0,99o' 


A onot 
0,99o' 






CT T 






A OTOt 

0.9 /o ' 


A OOOt 

0.99Z' 


A OOOt 

0.99o ' 






ReliefF 


A A /; o t 


A ATA'i' 

0.970 


A A O 1 1' 

0.981 ' 


A A A of 

0.993' 


A AATI' 

0.997' 


A AAO 

0.998 




MDFS 


0.950 


0.959 


0.963 


0.970 


0.968 


0.966 




MAULD 


0.929' 


0.909' 


A AT7t 

0.927' 


A A /lot 

0.949' 


A c\^n 

{).9o7 


0.966 




mRMR 


0.964^ 


0.966* 


0.972* 


0.972 


0.969 


0.966 




FSDD 


0.928' 


0.905* 


0.926* 


0.952* 


0.961* 


0.966 


L.4.D 


SCHI 


0.943 


0.949* 


0.954* 


0.966 


0.966' 


0.966 




ssu 


0.925" 


0.932" 


0.954* 


0.966 


0.967 


0.966 




CHI 


0.933* 


0.910* 


0.937* 


0.966 


0.972* 


0.966 




su 


0.928* 


0.907* 


0.925* 


0.954^ 


0.965^ 


0.966 




T» 1 • XT? 

RelieiF 


0.921* 


0.916* 


0.932* 


0.954* 


0.965 ' 


A A^^ 

0.966 




MDFS 


0.929 


0.965 


0.979 


0.987 


0.989 


0.982 




MAUCD 


0.880* 


0.919* 


0.956* 


0.983* 


0.991 


0.982 




mRMR 


0.942* 


0.956* 


0.977 


0.984* 


0.987* 


0.982 




FSDD 


0.887* 


0.910* 


0.941* 


0.967* 


0.984* 


0.982 


INN 


SCHI 


0.896* 


0.952* 


0.970* 


0.981* 


0.986* 


0.982 




SSU 


0.887* 


0.923* 


0.950* 


0.973* 


0.986* 


0.982 




CHI 


0.876* 


0.917* 


0.953* 


0.985 


0.988 


0.982 




SU 


0.891* 


0.914* 


0.946* 


0.978* 


0.985* 


0.982 




ReUefF 


0.885* 


0.920* 


0.955* 


0.977* 


0.985* 


0.982 




MDFS 


0.946 


0.984 


0.994 


0.996 


0.997 


0.999 




MAUCD 


0.880* 


0.945* 


0.979* 


0.995 


0.998* 


0.999 




mRMR 


0.947 


0.980* 


0.994 


0.996 


0.997 


0.999 




FSDD 


0.881* 


0.942* 


0.972* 


0.993* 


0.998 


0.999 


SVM 


SCHI 


0.919* 


0.969' 


0.990* 


0.995* 


0.997 


0.999 




SSU 


0.898* 


0.959* 


0.973* 


0.994* 


0.998* 


0.999 




CHI 


0.894* 


0.952* 


0.977* 


0.997 


0.998* 


0.999 




SU 


0.934* 


0.956* 


0.979* 


0.993* 


0.999* 


0.999 




ReUefF 


0.925* 


0.959* 


0.982* 


0.994* 


0.997 


0.999 
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Table 9: Average MAUC obtained with the nine compared methods on the Thyroid data set. The results were obtained 
by repeating 10-fold cross-validation procedure for 10 times. For each classifier and feature subset size, the Wilcoxon 
signed-rank test with 95% confidence level is employed to compare MDFS and 8 other methods. The methods that 
performed significantly worse (better) than MDFS are highlighted with t(t). No superscript is used if no statistical 
significant difference is detected. The largest MAUC value of each configuration is in boldface. 



Feature Subset Size 







10 


20 


30 


40 


50 


60 


70 


80 


90 


100 




MDFS 


0.857 


0.889 


0.905 


0.911 


0.911 


0.910 


0.906 


0.904 


0.902 


0.901 




MAUCD 


0.879* 


0.890 


0.895* 


0.900* 


0.906+ 


0.906"' 


0.906 


0.905 


0.903 


0.900 




mRMR 


0.882* 


0.893 


0.899* 


0.903* 


0.908 


0.907 


0.906 


0.906 


0.902 


0.903 




FSDD 


0.871* 


0.892 


0.909 


0.910 


0.909 


0.911 


0.911 


0.913* 


0.914* 


0.913* 


Naive Bayes 


SCHI 


0.872* 


0.881* 


0.885* 


0.888* 


0.891+ 


0.891+ 


0.894* 


0.898* 


0.900 


0.902 




SSU 


0.874* 


0.880 


0.889* 


0.890* 


0.893+ 


0.895+ 


0.898* 


0.899 


0.903 


0.904 




CHI 


0.860 


0.882 


0.893* 


0.902* 


0.905+ 


0.909 


0.909 


0.909 


0.907 


0.904 




su 


0.868 


0.882 


0.892* 


0.898* 


0.905 


0.908 


0.909 


0.908 


0.906 


0.903 




ReliefF 


0.855 


0.882 


0.891* 


0.895* 


0.900+ 


0.901+ 


0.901+ 


0.900 


0.901 


0.903 




MDFS 


0.746 


0.759 


0.757 


0.778 


0.775 


0.769 


0.760 


0.762 


0.753 


0.748 




MAUCD 


0.747 


0.763 


0.776 


0.777 


0.768 


0.760 


0.756 


0.754 


0.751 


0.748 




mRMR 


0.767 


0.769 


0.750 


0.742' 


0.748"'' 


0.750 


0.740+ 


0.739* 


0.744 


0.743 




FSDD 


0.775* 


0.763 


0.769 


0.766 


0.744' 


0.738"' 


0.759 


0.753 


0.746 


0.744 


C4.5 


SCHI 


0.807* 


0.770 


0.767 


0.768 


0.772 


0.761 


0.765 


0.749 


0.751 


0.749 




ssu 


0.816* 


0.781* 


0.772 


0.767 


0.760 


0.762 


0.759 


0.762 


0.760 


0.752 




CHI 


0.795* 


0.784* 


0.788* 


0.783 


0.768 


0.758 


0.756 


0.751 


0.751 


0.748 




SU 


0.764 


0.767 


0.781* 


0.785 


0.772 


0.764 


0.758 


0.751 


0.746 


0.749 




ReliefF 


0.719* 


0.738 


0.738* 


0.740"^ 


0.736+ 


0.736+ 


0.742 


0.739* 


0.738 


0.748 




MDFS 


0.727 


0.762 


0.784 


0.806 


0.819 


0.827 


0.817 


0.820 


0.815 


0.818 




MAUCD 


0.753* 


0.763 


0.770 


0.785* 


0.798+ 


0.799+ 


0.814 


0.808* 


0.812 


0.815 




niRMR 


0.753* 


0.756 


0.767 


0.792+ 


0.797+ 


0.805+ 


0.815 


0.819 


0.815 


0.814 




FSDD 


0.748 


0.741* 


0.771 


0.773+ 


0.797+ 


0.810+ 


0.814 


0.813 


0.821 


0.820 


INN 


SCHI 


0.738 


0.755 


0.761* 


0.760' 


0.760"'' 


0.752"'" 


0.769+ 


0.768* 


0.780* 


0.787* 




SSU 


0.747* 


0.760 


0.758* 


0.764+ 


0.775'' 


0.778"' 


0.778+ 


0.777* 


0.798* 


0.797* 




CHI 


0.748* 


0.743 


0.769 


0.772+ 


0.781 + 


0.790+ 


0.799+ 


0.798* 


0.799* 


0.802 




SU 


0.762* 


0.749 


0.761* 


0.772+ 


0.785+ 


0.799+ 


0.798+ 


0.800* 


0.801 


0.796* 




ReliefF 


0.718 


0.750 


0.765 


0.774+ 


0.769+ 


0.773+ 


0.785* 


0.787* 


0.788* 


0.786* 




MDFS 


0.758 


0.792 


0.815 


0.827 


0.828 


0.832 


0.827 


0.823 


0.819 


0.825 




MAUCD 


0.785* 


0.794 


0.775* 


0.784+ 


0.798+ 


0.800+ 


0.800* 


0.810 


0.814 


0.820 




mRMR 


0.801* 


0.790 


0.787* 


0.793+ 


0.811+ 


0.823 


0.830 


0.823 


0.830 


0.837 




FSDD 


0.797* 


0.783 


0.784* 


0.788+ 


0.792+ 


0.796+ 


0.806* 


0.821 


0.830 


0.834 


SVM 


SCHI 


0.804* 


0.801 


0.783* 


0.791+ 


0.795+ 


0.806+ 


0.795* 


0.800* 


0.804 


0.817 




SSU 


0.797* 


0.798 


0.784* 


0.795+ 


0.790+ 


0.799+ 


0.806* 


0.816 


0.820 


0.834 




CHI 


0.795* 


0.808* 


0.796* 


0.795+ 


0.800+ 


0.804+ 


0.810* 


0.819 


0.809 


0.811* 




SU 


0.801* 


0.818* 


0.802 


0.805+ 


0.804+ 


0.806+ 


0.821 


0.815 


0.822 


0.827 




ReliefF 


0.762 


0.781 


0.781* 


0.782+ 


0.789+ 


0.795+ 


0.794* 


0.800* 


0.806 


0.805* 
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Table 10: Average runtime (in seconds) of the compared methods on the 8 data sets. 



Data Set 





ISOLET 


MNIST 


USPS 


Phoneme 


Washington 


Indiana 


Synthetic 


Thyroid 


MDFS 


38.815 


18.765 


6.223 


1.778 


5.814 


6.351 


0.054 


0.268 


MAUCD 


43.109 


20.949 


6.324 


2.161 


6.324 


7.090 


0.063 


0.323 


mRMR 


773.153 


716.781 


114.117 


60.939 


204.143 


153.125 


0.578 


22.945 


FSDD 


0.512 


1.363 


0.241 


0.100 


0.293 


0.187 


0.002 


0.036 


SCHl 


170.276 


76.982 


33.007 


10.615 


26.889 


24.107 


0.282 


1.000 


SSU 


174.870 


82.301 


34.468 


10.603 


27.745 


24.539 


0.273 


0.973 


CHI 


18.624 


8.068 


2.796 


3.417 


5.112 


3.839 


0.100 


0.293 


su 


19.072 


8.242 


2.964 


3.374 


5.189 


3.945 


0.091 


0.285 


ReliefF 


3.434 


5.706 


1.774 


0.740 


1.748 


1.551 


0.037 


0.285 



its AUC values on every binary sub-problem that consists of a pair of classes. Therefore, MDFS 
first decomposes a multi-class problem to a batch of binary class sub-problems in one-versus-one 
manner Then, features are iteratively selected based on their utility on each sub-problem. Equal 
focus on every sub-problem is implemented by choosing one of them with equal probabiUty in 
each iteration. 

The advantage of MDFS over traditional filter methods has been justified by comparative 
studies on 8 benchmark data sets. Results obtained with 4 types of classifiers demonstrated that 
MDFS is overall superior to the 8 other compared filter methods in terms of the MAUC values 
of classification systems. Experimental studies also showed that the direct use of MAUC as 
feature ranking metric led to inferior performance compare d to MDFS . Finally, the computational 
complexity of MDFS is comparable to that of most compared filter feature selection methods. 

MDFS might be further improved from two aspects: First, the employment of a random 
strategy by MDFS in selecting sub-problems does not take the possible correlation between sub- 
problems into account. If a number of sub-problems are highly correlated with one another, 
many more features would be selected for them than for the other sub-problems. This will lead 
to redundant features and make the weight of each problem not equal in feature selection, which 
may lead to inferior performance of MAUC-oriented classification systems. Hence, finding a 
way to detect this correlation among sub-problems or the relative importance of diff'erent sub- 
problems in maximizing MAUC is of great interest. Second, redundancy among features has not 
been considered in MDFS. Having the great success of some redundancy-exclusive strategies 
OlTIl in mind, incorporating them into MDFS may promise enhanced performance. We will 
investigate these two issues in the future. 
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